28 research outputs found

    A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

    Get PDF
    Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.Comment: 30 pages, plus supplementary material

    Greater data science at baccalaureate institutions

    Get PDF
    Donoho's JCGS (in press) paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing with data science for years, and encourages acceptance of the direction of the field while also ensuring that statistics is tightly integrated. As faculty at baccalaureate institutions (where the growth of undergraduate statistics programs has been dramatic), we are keen to ensure statistics has a place in data science and data science education. In his paper, Donoho is primarily focused on graduate education. At our undergraduate institutions, we are considering many of the same questions.Comment: in press response to Donoho paper in Journal of Computational Graphics and Statistic

    ΔSCOPE: A New Method to Quantify 3D Biological Structures and Identify Differences in Zebrafish Forebrain Development

    Get PDF
    Research in the life sciences has traditionally relied on the analysis of clear morphological phenotypes, which are often revealed using increasingly powerful microscopy techniques analyzed as maximum intensity projections (MIPs). However, as biology turns towards the analysis of more subtle phenotypes, MIPs and qualitative approaches are failing to adequately describe these phenotypes. To address these limitations and quantitatively analyze the three-dimensional (3D) spatial relationships of biological structures, we developed the computational method and program called ∆SCOPE (Changes in Spatial Cylindrical Coordinate Orientation using PCA Examination). Our approach uses the fluorescent signal distribution within a 3D data set and reorients the fluorescent signal to a relative biological reference structure. This approach enables quantification and statistical analysis of spatial relationships and signal density in 3D multichannel signals that are positioned around a well-defined structure contained in a reference channel. We validated the application of ∆SCOPE by analyzing normal axon and glial cell guidance in the zebrafish forebrain and by quantify- ing the commissural phenotypes associated with abnormal Slit guidance cue expression in the forebrain. Despite commissural phenotypes which display disruptions to the reference structure, ∆SCOPE was able to detect subtle, previously uncharacterized changes in zebrafish forebrain midline crossing axons and glia. This method has been developed as a user-friendly, open source program. We propose that ∆SCOPE is an innovative approach to advancing the state of image quantification in the field of high resolution microscopy, and that the techniques presented here are of broad applications to the life science field

    Facilitating Team-Based Data Science: Lessons Learned from the DSC-WAV Project

    Get PDF
    While coursework provides undergraduate data science students with some relevant analytic skills, many are not given the rich experiences with data and computing they need to be successful in the workplace. Additionally, students often have limited exposure to team-based data science and the principles and tools of collaboration that are encountered outside of school. In this paper, we describe the DSC-WAV program, an NSF-funded data science workforce development project in which teams of undergraduate sophomores and juniors work with a local non-profit organization on a data-focused problem. To help students develop a sense of agency and improve confidence in their technical and non-technical data science skills, the project promoted a team-based approach to data science, adopting several processes and tools intended to facilitate this collaboration. Evidence from the project evaluation, including participant survey and interview data, is presented to document the degree to which the project was successful in engaging students in team-based data science, and how the project changed the students\u27 perceptions of their technical and non-technical skills. We also examine opportunities for improvement and offer insight to other data science educators who may want to implement a similar team-based approach to data science projects at their own institutions

    openWAR: An Open Source System for Evaluating Overall Player Performance in Major League Baseball

    Get PDF
    Within baseball analytics, there is substantial interest in comprehensive statistics intended to capture overall player performance. One such measure is Wins Above Replacement (WAR), which aggregates the contributions of a player in each facet of the game: hitting, pitching, baserunning, and fielding. However, current versions of WAR depend upon proprietary data, ad hoc methodology, and opaque calculations. We propose a competitive aggregate measure, openWAR, that is based upon public data and methodology with greater rigor and transparency. We discuss a principled standard for the nebulous concept of a "replacement" player. Finally, we use simulation-based techniques to provide interval estimates for our openWAR measure.Comment: 27 pages including supplemen

    Conducting clinical trials in persons with Down syndrome : summary from the NIH INCLUDE Down syndrome clinical trials readiness working group

    Get PDF
    The recent National Institute of Health (NIH) INCLUDE (INvestigation of Co-occurring conditions across the Lifespan to Understand Down syndromE) initiative has bolstered capacity for the current increase in clinical trials involving individuals with Down syndrome (DS). This new NIH funding mechanism offers new opportunities to expand and develop novel approaches in engaging and effectively enrolling a broader representation of clinical trials participants addressing current medical issues faced by individuals with DS. To address this opportunity, the NIH assembled leading clinicians, scientists, and representatives of advocacy groups to review existing methods and to identify those areas where new approaches are needed to engage and prepare DS populations for participation in clinical trial research. This paper summarizes the results of the Clinical Trial Readiness Working Group that was part of the INCLUDE Project Workshop: Planning a Virtual Down Syndrome Cohort Across the Lifespan Workshop held virtually September 23 and 24, 2019
    corecore